"Super luminal" FITS File Processing on Multiprocessors: Zero 
Time Endian Conversion Technique 
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ABSTRACT 

The FITS is the standard file format in astronomy, and it has been extended to agree with 
astronomical needs of the day. However, astronomical datasets have been inflating year by year. 
In case of ALMA telescope, a ~ TB scale 4-dimensional data cube may be produced for one 
target. Considering that typical Internet bandwidth is a few 10 MB/s at most, the original data 
cubes in FITS format are hosted on a VO se rver, and the regio n which a user is interested in 



should be cut out and transferred to the user (|Eguchi et al.ll2012l) . The system will equip a very 



high-speed disk array to process a TB scale data cube in a few 10 seconds, and disk I/O speed, 
endian conversion and data processing one will be comparable. Hence to reduce the endian 
conversion time is one of issues to realize our system. In this paper, I introduce a technique 
named "just-in-time endian conversion" , which delays the endian conversion for each pixel just 
before it is really needed, to sweep out the endian conversion time; by applying this method, the 
FITS processing speed increases 20% for single threading, and 40% for multi-threading compared 
to CFITSIO. The speed-up by the method tightly relates to modern CPU architecture to improve 
the efficiency of instruction pipelines due to break of "causality" , a programmed instruction code 
sequence. 



Subject headings: 
analysis 

Introduction 



Astronomical databases: miscellaneous — Virtual observatory tools — Methods: data 



The Flexible Image Transport System (FITS) 
is the standard data format for astronomical 
observed data even though they are products 
through calibration pipelines or otherwise. One 
FITS file can store multiple CCD images and pho- 
ton event lists as tables, and this feature makes 
FITS format prevail from the radio band to the 
X-ray band. Especially, most archival datasets 
and source catalogs are provided as FITS files in 
these days. 

The original purpose of the FITS format was 
to transport digital astronomical images from 
a computer to another with a magnetic tape 



(jWells et al.l 119811 ). There were no unified stan- 
dard for computers at that time, and bit size 
assigned to a character and an integer was quite 
different from one model to another, even from 
the same makers. Thus the authors newly had 
to create a machine independent and future ex- 
pandable image format for data exchange, FITS. 
Since then the FITS format has been repeatedly 
extended to agree with astronomical needs of the 



day (e.g., ICreisen fc HartenI Il98ll : ICrosbol et al 
1988h . 
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However, we will look at the issue of astronom- 
ical data infiation, not of the format, in the years 
ahead; Atacama Large Millimeter/submillimeter 
Array (ALMA), which is the largest radio tele- 
scope built on the Chajnantor plateau in northern 
Chile, started observations last year. ALMA is es- 



timated to generate ~200 TB observational raw 
data every year, and the volume of a processed 4- 
dimensio nal data c ubqjj for one target may exceed 
>2 TB (|Lucas et all 12004) . Furthermore, Large 
Synoptic Survey Telescope (LSST), a project in 
2020s, will generate 30 TB data every nighll. We 
need a system which assists astronomers to find 
something interested in such big data. 

Looking at such future. National Astronomi- 
cal Observatory of Japan has been developing a 
large data providing system for ALMA utilizing 
the technology of Virtual Observatory (VO) to 
share our outputs with global astronomical com- 
munities; all processed datasets (FITS files) are 
hosted on a VO server, and an user can select cut- 
out region to d ownload by a web-b ased graphical 
user interface (|Eguchi et al.l l2012l Paper I here- 
after) . 

A prototype service is already publico, and I 
am working on its optimization now. The system 
has to process a TB scale data cube in a few 10 
seconds for users' convenience, thus it is planned 
to equip a very high-speed disk arrajQ and disk 
I/O speed and data processing one will be com- 
parable. All the components of the system consist 
of Intel platform, which adopts little endian, while 
the FITS format does big endian. For the inter- 
active TB size FITS file processing system, the 
endian conversion time is not negligible. 

In this paper, I introduce a technique to make 
the endian conversion time apparently disappear, 
and to make the system much faster by multipro- 
cessing. I describe the hardware and software con- 
figuration for evaluation in Section 2, and com- 
pare endian conversion algorithms and their per- 
formance in Section 3. In Section 4, I examine 
the best timing for endian conversion, and discuss 
the performance increase by the conversion timing 
in Section 5. Through the paper, I repeated mea- 
surements 100 times for each item, and adopted its 
sample standard deviation (a square root of unbi- 
ased variance) as 1-a statistical error, ignoring any 
systematic ones. 
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''A system which consists of 16 striping solid state disks 

(SSDs) in the consumer products market effectively reaches 

~ 4 GB/s read/write performance. 



2. Configuration and Test Data 

Table [T] shows the hardware and software con- 
figuration used for verification of the method. I 
used two types of CPUs, Intel Core 17-2600 (for 
Machine A) and AMD FX-8350 (for Machine B), 
to prevent bias due to microarchitecture. Through 
the paper, Intel Turbo Boost Technology (the for- 
mer) and AMD Turbo CORE Technology (the 
latter) are disabled by BIOS for simplicity. In 
addition, Intel Hyper-Threading Technology (the 
former) is also disabled for the same reason. Thus 
Machine A and B are available 4 and 8 physical 
processors, respectively. The memory bandwidths 
and storage speeds were obtained the follow- 
ing commands: dd if =/dev/zero of =/dev/null 
bs=lG count=100, and hdparm -t (device), re- 
spectively. 

The same software is installed in both comput- 
ers: Ubuntu 12.04.1 LTS (amd64), a Debian based 
64-bit Linux, for operating system, GNU Compiler 
Collection (GCC) Version 4.6 for C/C-l— I- compiler 
(gcc/g++), and CFI TSIO Version 3.310 for C lan- 
guage FITS library (|Pencell2010l ). I applied the 
-02 -pipe -Wall compile options to CFITSIO 
and programs used in the paper. The Streaming 
SIMEi Extensions 2 (SSE2) codes in CFITSIO 
was enabled since I built the library on a 64-bit 
Linujo, but the SSSE3 option was disabled since 



^Single Instruction/Multiple Data 

^ There is no way to make the _SSE2__ macro undefined with 

64-bit GCC, which switches the codes for SSE2 or otherwise 

in CFITSIO. 




Fig. 1. — Test FITS image to evaluate paral- 
lelization efficiencies: a false color mosaic image 
of Carina Nebula obtained with Hubble Space 
Telescope, consisting of 29,566x14,321 pixels (3.4 
GB). 



Table 1 

The hardware and software information used for the evaluations 



Machine A 



Machine 



CPU Intel Core i7-2600 (3.4 GHz) AMD FX-8350 (4.0 GHz) 

RAM 8 GB (8.80±0.03 GB/s) 16 GB (22.04 ± 0.07 GB/s) 

Storage SSD (Read: 506.5 ± 0.7 MB/s) HDD (Read: 143 ± 2 MB/s) 

Operating System Ubuntu 12.04.1 (amd64) 

C/C++ Compiler GNU Compiler Collection Version 4.6 

FITS Library CFITSIO Version 3.310 



Note. — Intel Turbo Boost and Hyper-Threading Technologies (for Machine A), AMD 
Turbo CORE Technology (for Machine B) are disabled through the paper. Hence 4 and 
8 physical processors are available for Machine A and B, respectively. 



the SSSE3 instruction set is treated as an exten- 
sion in the amd64 environment. 

I use a false color mosaic image of Carina Neb- 
ula obtained with Hubble Space Telescopcu for 
test data. The image is public in Tagged Image 
File Format (TIFF), thus I converted it into a 
gray scale double precision FITS file with convert 
command provided by ImageMagic43. The size is 
29,566 pixels in width and 14,321 pixels in height. 
The file volume is 3.4 GB (Figure[I|). Th rough the 
pape r, I put this FITS file on a tmpfs (jRohland 
20011 ) mounted on /run/shm, to ensure that the 
file is always on memory for fast access. See Ap- 
pendix |X] for the difference between tmpfs and 
ramdisk. 

3. Endian Conversion Algorithms 
3.1. Formalism 

Let (6i, 62, ■ • ■ , ^s) be a byte sequence of an in- 
ternal expression of a 64-bit size value a. The 64- 
bit endian conversion of a can be expressed with 
a permutation a as 



a' = {ba(i)-,bc 



(l),0^(2),-- 



■X 



(8)J 



where 



12 3 4 5 6 7 8 
8 7 6 5 4 3 2 1 



(1) 



(2) 



in Cauchy's two-line notation, and cr^ = 1 (Fig- 
ure O. 



3.2. Implementation 

3.2.1. Byte Shuffle: Straightforward Implemen- 
tation 

A straightforward implementation of Eq. ([1]) 
and Eq. ([2]) can be written as follows: 

uint64_t byte_shuf f le(uint64_t a) 
{ 

unsigned char *p = (unsigned char *)&a; 

unsigned char tmp; 



tmp = p[7] ; p[7] = p[0] 

tmp = p[6] ; p[6] = p[l] 

tmp = p[5] ; p[5] = p[2] 

tmp = p [4] ; p [4] = p [3] 

return a; 



p[0] = tmp; 

p[l] = tmp; 

p[2] = tmp; 

p[3] = tmp; 



' |http : //hubbleslte . org/ne«scenter/archive/releases/2007/16/lmage/a/ 1 
'^http : / /wwH .imagemagick.org/scrlpt/lndex. php | 



} 



I now call this method "byte shuffle" . One will 
find a short discussion about another implemen- 
tation of byte shuffle algorithm in Appendix |B] 

3.2.2. Bit Shift 

Another implementation to perform endian 
conversion is to use both bit shift and logical op- 
erations: 

uint64_t bit_shift(uint64_t a) 
{ 

return ((a & OxOOOOOOOOOOOOOOFFULL) 
« 56) 
I ((a & OxOOOOOOOOOOOOFFOOULL) 

« 40) 
I ((a & OxOOOOOOOOOOFFOOOOULL) 



« 24) 
((a & OxOOOOOOOOFFOOOOOOULL) 

« 8) 
((a & OxOOOOOOFFOOOOOOOOULL) 

» 8) 
((a & OxOOOOFFOOOOOOOOOOULL) 

» 24) 
((a & OxOOFFOOOOOOOOOOOOULL) 

» 40) 
((a & OxFFOOOOOOOOOOOOOOULL) 

» 56); 



Hereafter, I call this method "bit shift" . 

3.2.3. BSWAP 

Intel i486 and later processors have the BSWAP 
instruction, which converts the endian on a 
given 32-bit register. The instruction is ex- 
tended in order to accept a 64-bit register in 
amd64 (jlntell |2012|) . Furthermore, GCC Ver- 
sion 4.3 and later have a helper ftmction to call 
the instruction, and its prototype is uint64_t 
__builtin_bswap64(uint64_t x) ; . Now I call en- 
dian conversions utilizing this function "BSWAP" . 



3.2.. 



SSE2 



SSE2 is a set of vector instructions for Intel 
platform, became a part of default instruction set 
for amd64 environment. The endian conversion 
codes utilizing SSE2 can process two 64-bit values 
at once, and be written as follows: 

#include <eininintrin.h> 



Low 



Memory Address 



High 



by 




b'l b'2 b'3 b'4 b'5 b'e b'y b'a 



bi : A Byte Value 



Fig. 2. — The schematic diagram of a permutation 
operator a for the endian conversion of a 64-bit 
value. 



void sse2(uint64_t a [2]) 
{ 

__nil28i rO = _mm_load_sil28( (__ml28i *)a); 

// rO <- a 
__ml28i rl = _inm_srli_epil6(r0, 8); 

// 8-bit shifts towards right 
// for four 2-byte integers 
__ml28i r2 = _inm_slli_epil6(r0, 8); 

// 8-bit shifts towards left 
// for four 2-byte integers 
rO = _mm_or_sil28(rl, r2) ; 

// 128-bit or operation 
// on rl and r2 
rO = _mm_shuf f Ielo_epil6(r0, 

_MM_SHUFFLE(0, 1, 2, 3)); 
// byte shuffle for the 
// lower half of rO register 
rO = _mm_shuf f Iehi_epil6(r0, 

_MM_SHUFFLE(0, 1, 2, 3)); 
// byte shuffle for the 
// higher half of rO register 
_mm_store_sil28((__ml28i *)a, rO) ; 
// a <- rO 
} 

There are almost the same codes in CFITSIO 
and SLLIB/SFITSld!!. I caU these codes simply 
"SSE2", hereafter. 

3.3. SSSE3 

Another vector instruction set called "SSSE3" 
is available for Intel Core series and later CPUs. 
Utilizing this instruction set, one can perform en- 
dian conversion of two 64-bit values at one instruc- 
tion. An example is follows: 

#include <tmmintrin.h> 
void ssse3(uint64_t a [2]) 
{ 

static const inl28i mask 

= _min_set_epi8( 

8, 9, 10, 11, 12, 13, 14, 15, 
0, 1, 2 ,3, 4, 5, 6, 7 

); 

__ml28i r = _mm_load_sil28( (__ml28i *)a); 

ml28i r = _mm_shuf f le_epi8(r , mask); 

_mm_store_sil28(( ml28i *)a, r) ; 

} 



^http : //www ■ ir . Isas ■ j axa ■ jp/~ cyamauch/sll/lndex . html | 







Table 2 
Endian Conversion Time 




Machine 


Bit Shift (msec) 


BSWAP (msec) SSE2 (msec) SSSE3 (msec) 


Byte Shuffle (msec) 


Machine A 
Machine B 


410 ±2 
601.3 ±0.6 


410 ±2 405 ±2 372.4 ±0.1 
605.8 ±0.7 582.3 ±0.7 598 ± 2 


3190.3 ±0.6 
8056.3 ±0.3 



Note. — The endian conversion time of 423,414,686 (—29,566x14,321) double-type elements with various 
algorithms. 



There are almost same codes in CFITSIO too. 
I call these codes simply "SSSE3" , hereafter. 

3.4. Benchmark 

To see which one is fastest and how they behave 
towards parallelization, I performed simple bench- 
mark. In the benchmark, I reserved a double- 
type array whose number of elements were set 
to 29,566x14,321 = 423,414,686, just the num- 
ber of pixels in Figure [TJ and filled the array for 
uniform real random numbers of 32-bit resolution 
on [—1000, 1000] generated with Mersenne Twister 
(iMatsumoto fc Nishim"uralll998[ ). 



3.4- 1- Single Thread 

The results are summarized in Table[2j For Ma- 
chine A, three algorithms except for SSSE3 and 
byte shuffle process the test data in about 410 
milliseconds, while SSSE3 does about 370 millisec- 
ond. On the other hand, for Machine B, four al- 
gorithms except for byte shuffle process the test 
data in about 600 milliseconds, and the SSE2 al- 
gorithm is fastest in the all ones. The byte shuffle 
algorithm is slowest by one oder compared to the 
others. 

3.4.2. Multi-Thread 

I also examined the CPU-scalability of these al- 
gorithms. I adopted pthread for parallelization, 
and simply divided the array containing the test 
data into equal-size segments so that the total 
number of the segments were equal to the num- 
ber of threads. Then I assigned each thread with 
each segment. 

Figure |3] represents the results. I also list the 
observed values for detailed comparison of the al- 
gorithms in Table |3] (for Machine A) and Table |4] 
(for Machine B). Except for byte shuffle algorithm. 



I observed ~ 10% performance gain for Machine 

A, and ~ 40% up to four threads for Machine B 
with the four algorithms. 

It seems strange that the memory bandwidth 
of Machine B is sufficient for the test data size 
but the four algorithms show performance cutoff 
at four threads. I performed detailed hardware 
benchmark utilizing LMbencHlj, and found that 
context switching time and the latency of L2 cache 
memory normalized in CPU cycles of Machine B 
are 2.4 times and 4.6 times, respectively, larger 
than those of Machine A. Hence I conclude that 
there are some hardware bottlenecks in Machine 

B, which cause the plateau in Figure [3l 

The behaviors of the four algorithms with re- 
spect to the number of threads are very similar, 
and I adopt bit shift algorithm in the next section 
because of its compiler portability and identical- 
ness to BSWAP (see Appendix ICJ). 

4. Endian Conversion Timing 

A modern CPU has multiple arithmetic logic 
units (ALUs) and instruction pipelines to boost 
the operating rates of ALUs. As seen in the previ- 
ous section, the hardware limitation lies just below 
the endian conversion time of single thread (Fig- 
ure[31 Machine A), preventing the CPU scalability. 
This may lead to many holes (or "no operation" 
instructions) in the pipelines and reduce the per- 
formance. If this is the case, shuffling instructions 
in source codes can produce improvement. 

To verify this assumption, I disabled the endian 
conversion functionality in CFITSIO; I changed 
the BYTESWAPPED macros for 1386 and amd64 ar- 
chitectures from TRUE into FALSE in fitsio2.h, 
and commented out the codes which CFITSIO 



^http ; //www . bitmover . com/lmbench/ 
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Fig. 3. — CPU-scalability comparison between the endian coversion algorithms. All the algorithms except 
for byte shuffle seem to behave in the same way and scale up to only the half number of CPU cores due to 
the hardware I/O limits. 



Table 3 
The CPU-scalability of the endian conversion alogorithms on Machine A 



Number of Threads Bit Shift (n 



BSWAP (msec) SSE2 (msec) SSSE3 (msec) Byte Shuffle (msec) 



410 ±2 


411 ±3 


404 ±2 


372.4 ±0.2 


3191.8 ±0.7 


366 ±3 


366 ±3 


367 ±2 


360.4 ±0.5 


1605 ± 2 


362 ±3 


362 ±3 


362 ±3 


354.9 ±0.3 


1082 ± 6 


365 ±4 


365 ±4 


365 ±4 


358.8 ±0.5 


860 ± 20 



Note.— The CPU-scalability of 423,414,686 (^29,566x14,321) double-type clement endian conversion with various 
algorithms on Machine A. Different from Table [2] 16-byte memory alignment is adopted. 



Table 4 
The CPU-scalability of the endian conversion alogorithms on Machine B 



Bit Shift (msec) BSWAP (msec) SSE2 (msec) SSSE3 (msec) Byte Shuffle (msec) 



Number of Threads 



597.8 ±0.7 
484 ± 1 

429.9 ±0.8 
413 ±3 
418 ±3 
412 ± 1 

412.3 ±0.9 
416.1 ±0.6 



605.1 ±0.7 
499 ± 1 
445 ±3 
412 ±2 
417 ± 1 

411.5 ±0.8 

412.2 ±0.5 
415.1 ±0.7 



579.3 ± 0.9 
484 ± 1 
432 ±1 
413 ±3 
422 ± 2 
413 ± 1 
411.9 ±0.6 
415.5 ±0.4 



597 ±2 
495 ± 1 
440 ± 1 
419 ±7 
424 ± 2 
416 ± 1 
414.1 ±0.5 
414.5 ±0.4 



8056.2 ±0.2 
4046 ± 3 
2699 ± 4 
2031 ± 8 
1661 ± 5 
1388 ± 4 
1192 ±4 
1043 ± 2 



Note. — All conditions are same as Table [3] 



perform runtime check to verify whether the 
machine endian definition by the above macro 
is consistent with the execution environment in 
cfileio.c, and I rebuilt the hbrary. The patches 
for those files are shown in Appendix [El 
1 compare the following two methods; 

1 . It loads the full test image (Figure [IJ from 
tmpfs (see ^ onto an array, then it converts 
the endian by the parallelized bit shift algo- 
rithm (described in §3.4.2|) . and it sums up 
all the elements. 

2. It loads the full test image onto an array, and 
it sums up all the elements with converting 
the endian one after another by the bit shift 
algorithm. 

From here, I refer to the former as "on ahead 
endian conversion method", and to the latter as 
"just-in-time endian conversion method" . 

On ahead endian conversion method can be 
written as follows: 



double *v; // an array to store 

// a FITS image 

size_t len; // the length of 

// the array v 

// load a byte sequence from a FITS 
// file into v here. . . 

// endian conversion 

for (size_t i = 0; i < len; ++i) { 

uint64_t *p = (uint64_t *)&v[i]; 

uint64_t a = bit_shif t (*p) ; 

double *q = (double *)&a; 

v[i] = *q; 
} 



// process v here. . . 



} 



and just-in-time endian conversion method can be 
written as follows: 



double *v; // an array to store 

// a FITS image 

size_t len; // the length of 

// the array v 



// load a byte sequence from a FITS 
// file into v here. . . 

// image processing... 
{ 

// something. . . 

// one needs to refer the value 
// of v[i] here 

// endian conversion 

uint64_t *p = (uint64_t *)&v[i]; 

uint64_t a = bit_shif t (*p) ; 

double *q = (double *)&a; 

double X = *q; 

// use X instead of v[i] below 

// something. . . 



> 



where bit_shif t () is the endian conversion func- 
tion defined in §3.2.21 

In this section, I adopt summing up all the el- 
ements in the test image as an example of image 
processing. 

4.1. Single Thread 

I implemented both methods in single thread 
and performed benchmark. The codes of on ahead 
conversion method are following: 



{ 



// endian conversion 

for (size_t i = 0; i < len; ++i) { 

uint64_t *p = (uint64_t *)&v[i]; 

uint64_t a = bit_shif t (*p) ; 

double *q = (double *)&a; 

v[i] = *q; 
> 

// summation 

double sum = 0.0; 

for (size_t i = 0; i < len; ++i) ■[ 

sum += v[i] ; 
> 



and those of just-in-time endian conversion method 
are following: 



double sum = 0.0; 

for (size_t i = 0; i < len; ++i) { 

// endian conversion 

uint64_t *p = (uint64_t *)&v[i]; 

uint64_t a = bit_shif t (*p) ; 

double *q = (double *)&a; 

// summation 
sum += *q; 



Note that the former codes are identical to those 
with original CFITSIO. 

The results are summarized in Table [S] I ob- 
tained slightly faster (~ 5%) total processing time 
of 2.22 ±0.04 sec and 3.62 ±0.04 sec for Machine A 
and B, respectively, with on ahead endian conver- 
sion method, while that with original CFITSIO is 
2.38 ± 0.04 and 3.79 ± 0.03 for Machine A and B, 
respectively. 

On the other hand, I obtained significantly 
faster time of 1.85 ± 0.04 and 3.05 ± 0.03 for Ma- 
chine A and B, respectively, which corresponds to 
~ 25% performance gain, with just-in-time endian 
conversion method. 

4.2. Multi-Thread 

I made both methods multithreaded by utiliz- 
ing OpenMFl^ APIs for its simple implementa- 
tion. The codes of just-in-time endian conversion 
method, for example, are below: 



double sum = 0.0; 

#pragma omp parallel for reduction (+:sum)\\ 
schedule (auto) 

for (size_t i = 0; i < len; ++i) { 

// endian conversion 

uint64_t *p = (uint64_t *)&v[i]; 

uint64_t a = bit_shif t (*p) ; 

double *q = (double *)&a; 

// summation 
sum += *q; 



On the other hand, I could not find the best pa- 
rameters in OpenMP APIs for the endian conver- 
sion routine in on ahead conversion method, hence 
I applied OpenMP only to the summation routine, 
and adopted the pthread-based parallelization de- 
scribed in i)3.4.2l for the endian conversion routine 
in on ahead conversion method; the number of the 
threads for OpenMP was set to that for the endian 
conversion. 

The results obtained with these programs are 
summarized in Table [6] (for Machine A), Table [7] 
(for Machine B), Figure |4] (for on ahead endian 
conversion method), and Figure[5](for just-in-time 
endian conversion method). Note that the en- 
dian conversion time of on ahead endian conver- 
sion method is included in the FITS reading time. 
The total time to perform the same things with 
the original CFITSIO in single thread is superim- 
posed on these figures as a dotted line: 2.38 ± 0.04 
seconds for Machine A, and 3.79 ±0.03 seconds for 
Machine B. 

For the on ahead endian conversion method, the 
total time slightly scales the number of threads 
and gets faster than original CFITSIO, while 
the file reading time (including endian conversion 
time) seems to be little scalable. The scalability 
of the total time mostly owes that of the summa- 
tion routine, and the parallelization of the endian 
conversion has little impact due to the hardware 
limit seen in ij3.4.2l 

On the other hand, for the just-in-time endian 
conversion method, the total time is interestingly 
smaller than that of original CFITSIO even for 
single thread. The summation routine seems to 
be scalable almost in the full range, while the total 
time scales up to four threads. 

5. Discussion 

5.1. Performance Analysis of the Simple 
Summation Codes 

There is a well-known equation to estimate the 
incre ase by parallelization, Amdahl's law (jAmdahJ 
19671) : 



T, 



parallel 



(l-n + ^+aJT^singlc, (3) 



^^http : //openmp . org/wp/ | 



where Tsingic and Tparaiici represent processing 
time in single thread and mult-thread cases, re- 



Table 5 
The data processing times with two different method in single thread 



Method 


Machine 


FITS Read Time (sec) 


Sum Up Time (sec) 


Total Time (sec) 


On Ahead Endian Conversion 


Maehine A 
Machine B 


1.006 ± 0.003 
1.859 ±0.007 


0.4198 ±0.0002 
0.557 ±0.001 


2.22 ±0.04 
3.62 ±0.04 


Lasy Endian Conversion 


Machine A 
Machine B 


1.010 ±0.003 
1.874 ±0.007 


0.44003 ± 0.00008 
0.621 ±0.002 


1.85 ±0.04 
3.06 ± 0.03 



Note.— The total time with original CFITSIO is 2.38 ± 0.04 and 3.79 ± 0.03 for Machine A and B, respectively. 



Table 6 

The CPU-scalability of on ahead and just-in-time endian conversion methods on Machine 

A 



Method 



Number of Threads FITS Road Time (sec) Sum Up Time (sec) Total Time (sec) 



On Ahead Endian Conversion Method 



1.445 ± 0.003 
1.404 ±0.003 

1.401 ±0.003 
1.403 ± 0.003 



0.4205 ± 0.0001 
0.220 ±0.004 

0.181 ±0.002 
0.177 ±0.002 



2.38 ±0.04 
2.1 ±0.1 

2.11 ±0.09 
2.09 ±0.03 



Just-in-Time Endian Conversion Method 



1.048 ± 0.003 
1.051 ± 0.003 

1.051 ±0.003 

1.052 ± 0.003 



0.4416 ± 0.0009 
0.2262 ±0.0005 

0.183 ±0.001 

0.177 ±0.002 



2.0 ± 0.1 
1.8 ±0.1 

1.77 ±0.02 
1.76 ± 0.02 



Note. — The endian conversion time is included in the FITS reading time for on ahead endian conversion method. 



Table 7 
The CPU-scalability of on ahead and just-in-time endian conversion methods on Machine 

B 



Method 



Number of Threads FITS Read Time (see) Sum Up Time (see) Total Time (sec) 



On Ahead Endian Conversion Method 



2.526 ±0.005 


0.5522 ± 0.0006 


3.76 ±0.03 


2.403 ±0.005 


0.299 ±0.003 


3.39 ±0.01 


2.347 ±0.005 


0.220 ±0.002 


3.253 ±0.010 


2.326 ± 0.005 


0.193 ±0.003 


3.203 ±0.008 


2.334 ±0.006 


0.205 ± 0.003 


3.224 ±0.010 


2.326 ±0.005 


0.189 ±0.003 


3.20 ±0.01 


2.325 ±0.006 


0.178 ±0.002 


3.19 ±0.01 


2.334 ±0.005 


0.174 ±0.003 


3.19 ±0.01 


1.914 ±0.003 


0.582 ± 0.002 


3.19 ±0.05 


1.917 ±0.003 


0.328 ±0.002 


2.934 ±0.009 


1.916 ±0.004 


0.242 ±0.002 


2.853 ± 0.008 


1.916 ±0.004 


0.205 ±0.002 


2.812 ±0.009 


1.918 ±0.003 


0.205 ±0.001 


2.816 ±0.009 


1.918 ±0.003 


0.1948 ±0.0009 


2.801 ±0.010 


1.917 ±0.004 


0.185 ±0.001 


2.791 ±0.010 


1.917 ±0.003 


0.178 ±0.003 


2.781 ±0.010 



Just-in-Tiinc Endian Conversion Method 



Note. — The endian conversion time is included in the FITS reading time for on ahead endian conversion method. 
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Fig. 4. — The CPU-scalability of the FITS reading and processing time by applying the bit shift parallel 
algorithm to CFITSIO. I also applied a simple OpenMP parallelization to the summation routine. 
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Fig. 5. — The CPU-scalability of the just-in-time endian conversion algorithm. Endian conversions are 
performed in the summation routine, which is applied a simple OpenMP parallelization. 
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spectively, P is the ratio of codes which paral- 
lehzation methods are apphed tcO, N is the num- 
ber of threads, and a is the overhead caused by 
parallehzation. 

To quantify the performance increase of on 
ahead endian conversion method and just-in-time 
endian conversion method, I performed model fit- 
ting to the total time of both methods with Eq.([3|). 
I found that a - O (lO"^°) while the fitting, thus 
I fixed a at 0. The results are summarized in Ta- 
ble |8] and Figure |6l The increasing rates of perfor- 
mance compared to original CFITSIO (Tgingio = 
2.38 ±0.04 for Machine A and Tsingio = 3.79 ±0.03 
for Machine B) are also listed in the table. 

The figure shows that the above results are 
explained well by Amdahl's law, and that the 
on ahead endian conversion method for single 
thread has almost the same performance as orig- 
inal CFITSIO. In fact, these two agree with each 
other in < 5% errors according to the table. The 
table also suggests that multi-threading boosts 
this method up about 20%. Considering the par- 
allehzation rate P ~ 16%, one cannot expect fur- 
ther speed up by multi-threading in TV > 4. This 
suggests that the bottlenecks of other hardwares 
disrupt order in the instruction pipelines and leads 
to the decrease of operating ratio of ALUs. 

On the other hand, the just-in-time endian con- 
version method is 20% faster than both of origi- 
nal CFITSIO and the single thread version of on 
ahead one, surprisingly. This seems as if the en- 
dian conversion process disappeared. In the paral- 
lelized case, the just-in-time conversion method is 
40% faster than the others in single thread. How- 
ever, the performance increase by multi-threading 
can be expected only in A'' < 4 since the paral- 
lehzation rate P ~ 16%, due to the hardware bot- 
tlenecks mentioned above. 

For further investigation, I fitted the summa- 
tion time of these methods with Eq.(j31) to inves- 
tigate the impact of the endian conversion codes 
in the summation routine on performance; there 
are endian conversion codes in the summation 
routine in case of just-in-time endian conversion 
method, but not in case of on ahead endian con- 
version method. The results are summarized in 
Table [9] and Figure [T] I found that the par- 
allehzation rate P w 85% in both cases, and 



that the ratio of Tgingic of just-in-time endian 
conversion method against that of on ahead one 
r = Tsingic (Just — in — Time) /Tsingio (On Ahead) 
was equal to r = 1.02 ± 0.05 for Machine A and 
r = 1.03 ± 0.04 for Machine B. There is no over- 
head of endian conversion in the summation rou- 
tine, since the shift of r from unity is not signifi- 
cant statistically. 

Thus I conclude that endian conversion is so 
simple operation for a modern CPU that the bot- 
tlenecks of other hardwares disrupt order in the 
instruction pipelines; to prevent the disruption, 
the endian conversion should be done just before 
a value is referred. 

5.2. Application to ALMAWebQL 

From here, I only investigated the performance 
increase of summing up all the elements in a 
large FITS file by just-in-time endian conversion 
method. In this subsection, I apply the method 
to ALMAWebQL, our interactive web viewer for 
ALMA data cubes described in Paper I, to obtain 
more realistic benchmark data. For realistic and 
fair comparison, the SSE2 boosted endian conver- 
sion codes in CFITSIO are enabled for on ahead 
endian conversion method, while there is no SSE2 
code in just-in-time endian conversion method. 

ALMA data cubes not contain information of 
polarization currently, and they are simple 3- 
dimensional FITS files (Figure [8]) . For image ex- 
traction, one have to integrate the cube along the 
spectral direction; for spectrum extraction, one 
convolute all spatial information. 





/ 




^ 


/ 
/ 
\ 


/ 


/ 


/ 


-=^ 


IE: 




i 


S 


/, 


Data Cube 


^ 


^ 


/ 
/ 


/ 
/ 


/ 



^^Hardware bottlenecks arc included in the 1 — P term. 



Fig. 8. — The schematic illustration of a data cube 
of ALMA Science Verification Data. 
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Table 8 
The fitting results of the total processing time 













Increase Rate 


of Performance 


Method 


Machine 


T,i„gio (sec) 


P 


x" (d.o.f.-) 


Single Thread 


Multi-Thread 




Machine A 


2.377 ±0.009 


0.164 ±0.006 


0.09 (2) 


1.00 ±0.02 


1.20 ±0.02 


On Ahead Endiaii Conversion 


Machine B 


3.68 ±0.05 


0.16 ±0.02 


49*= (6) 


1.03 ±0.02 


1.22 ±0.03 




Machine A 


2.02 ±0.05 


0.17 ±0.03 


0.5 (2) 


1.18 ±0.04 


1.42 ±0.07 


Just-in-Timc Endian Conversion 


Machine B 


3.13 ±0.02 


0.130 ±0.009 


8.5 (6) 


1.21 ±0.01 


1.39 ±0.02 



Note. — The fitting results of the total processing time with respect to two different endian conversion methods with Amdahl's law 
and their increase rate of performance compared with original CFITSIO (Tgingic — 2.38 ± 0.04 for Machine A and Tsingic — 3.79 ih 0.03 
for Machine B). The errors are l-cr confidence limits for a single parameter. 

^Degrees of freedom 

The large x value is caused by very small errors of observed values, which agree with the model well (sec Figure [61. 



Table 9 
The fitting results of the time to sum up all elements 



Method 


Machine 


Ts 


ingle (msec) 


P 


X^ (d.o.f.) 


On Ahead Endian Conversion 


Maehine A 
Machine B 




420 ± 1 
551 ±6 


0.82 ±0.04 
0.81 ±0.02 


127.9 (2) 
491.1 (6) 


Just-in-Time Endian Conversion 


Maehine A 
Machine B 




430 ± 20 
570 ± 20 


0.92 ±0.06 
0.80 ±0.02 


913.4 (2) 
312.7 (6) 



Note.— Tsingie (Just - in - Time) /T^ingie (On Ahead) = 1.02 ± 0.05 (for Machine A), 1.03 ± 0.04 (for Maehine B). 
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Fig. 6. — The fitting results of on ahead and just-in-time endian conversion method with respect to total 
processing time with Amdahl's law. The red dashed and blue dash dotted lines correspond to the law for on 
ahead and just-in-time endian conversion method, respectively. 
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Fig. 7. — The fitting results of on ahead and just-in-time endian conversion method with respect to summa- 
tion time with Amdahl's law. The red dashed and blue dash dotted lines correspond to the law for on ahead 
and just-in-time endian conversion method, respectively. 

Table 10 
Image Extraction Time 



File 


Size (MB) 




21.4 




55.0 




66.9 




78.4 




88.7 




106.9 




150.3 



On Ahead Endian Conversion Method 



Just-in-Timc Endian Conversion Method 



0.02953 ± 0.00006 
0.75214 ±0.00003 
0.10007 ±0.00006 
0.13136 ±0.00008 
0.14435 ±0.00007 
0.1835 ± 0.0001 
0.2333 ± 0.0001 



0.0361 ± 0.0004 

0.0910 ±0.0001 

0.10295 ±0.00008 

0.12371 ±0.00008 

0.1299 ± 0.0001 

0.1606 ± 0.0002 

0.21006 ± 0.00008 



Note. — The time to extract an image from an ALMA data cube on Machine A in single thread. 
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Table 11 
Spectrum Extraction Time 



File Size (MB) On Ahead Endian Conversion Method 



Just-in-Tinic Endian Conversion Method 



21.4 
55.0 
66.9 
78.4 
88.7 
106.9 
150.3 



0.02743 ± 0.00002 
0.06970 ± 0.00002 
0.08512 ± 0.00002 
0.09861 ± 0.00003 
0.11271 ±0.00003 
0.13394 ±0.00004 
0.18769 ±0.00005 



0.0344 ± 0.0004 
0.07527 ± 0.00008 
0.08583 ± 0.00010 

0.0909 ± 0.0001 
0.10539 ±0.00008 

0.1153 ±0.0003 
0.15503 ±0.00006 



Note. — The time to extract a spectrum from an ALMA data cube on Machine A in single thread. 
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Fig. 9. — The image (left) and spectrum (right) extraction time with respect to file size of ALMA data cube. 
The red dashed and blue dash dotted lines correspond to the best fit ones for on ahead and just-in-time 
endian conversion method, respectively. 
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I measured the time to complete these compu- 
tations in single thread with various size data on 
Machine A. The results for image extraction are 
summarized in Table 1101 and those for spectrum 
extraction are summarized in Table 1111 From 
these tables, I obtain 



T (On Ahead) = (1.7 ± 0.1) x 10 



VMB. 
(0.013 ±0.007) sec, (4) 



T (Just - in - Time) = (1.27 ± 0.04) x 10"^ (-—) 



(0.021 ±0.003) sec (5) 



for image extraction and 



T(On Ahead) = (1.252 ± 0.007) x lO^M ) 

+ (0.0008 ±0.0004) sec, (6) 



T (Just — in — Time) 



(0.85 ± 0.02) X 10-3 (^^^ 
+ (0.028 ±0.002) sec (7) 



for spectrum extraction, where T (On Ahead) and 
T (Just — in — Time) represent the time with on 
ahead and just-in-time endian conversion meth- 
ods, respectively, and V is file size in the MB unit 
(Figure [5]). Hence just-in-time endian conversion 
method in single thread is > 20% faster than on 
ahead conversion method boosted by SSE2 above 
V > 200 MB. This demonstrates that just-in- 
time endian conversion method can be very pow- 
erful when one performs convolution and stack- 
ing of very large images, which are very common 
analysis techniques in optical band, obtained with 
future large telescopes. 

5.3. Data Types 

In this paper, I only treated a double precision 
FITS file, but one could expect almost the same 
results for float and LONG data types, which cor- 
respond to BITPIX — -32 and 32, respectively; as 
demonstrated in Appendix [Cl bit_shift() func- 
tion is compiled into BSWAP instruction. The 
amd64 architecture can handle both of 32 bit and 
64 bit operation codes and their operands seam- 
lessly. On the other hand, for byte and short data 



type, there may be little advantage of just- in-time 
endian conversion method since BSWAP instruction 
cannot take any 16 bit values as its operand, and 
up-casting into 32 bit integer always occurs in 
arithmetic operations in both cases. 

6. Summary 

The FITS format was originally developed to 
exchange digital astronomical datasets from a 
computer to another, but the progress of com- 
putation power and software technology enables 
one to process FITS files through web browsers. 
In addition, data size has been inflating year by 
year, and it will exceed ^ TB in the year ahead. 
To handle such big FITS file with web applica- 
tions, the endian conversion time from the FITS 
native to the machine one cannot be negligible, 
and a solution for this problem is required. 

In this paper, I compared the features of four 
typical endian conversion algorithms under multi- 
thread environment, and found the bit shift one 
was suitable for parallelization. Then I examined 
the best timing for endian conversion under multi- 
thread environment. I found that one should post- 
pone the endian conversion until a value is really 
referred in a program, because endian conversion 
is so simple for a modern CPU that the bottlenecks 
of other hardwares disrupt order in the instruction 
pipelines, which leads to the decrease of operating 
ratio of ALUs. In fact, by applying this method 
to loading 3.4 GB FITS file and sum up all the 
elements, the performance increased 20% for sin- 
gle thread and 40% for multi-thread compared to 
CFITSIO, which corresponded to > 600 millisec- 
onds, and one can be aware of the speed-up. No 
overhead of endian conversion was found on the 
summation routine; hence one can sweep the en- 
dian conversion time out of his/her codes. Note 
that parallelization of this method peaked out in 
four threads in the experiment. 

CPU vendors introduce various techniques, 
such as speculative execution and branch pre- 
diction, to improve the efficiency of instruction 
pipelines; an executed instruction code sequence 
is apart from a programmed one. In this con- 
text, modern CPUs partially break "causality", a 
programmed instruction code sequence, and gain 
speed. Just-in-time endian conversion method uti- 
lizes such boosting technology. There is nothing 
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new in the method, but it must be a smah step 
to handle astronomical big data generated by the 
next generation telescopes. 

I greatly appreciate Dr. Chisato Yamauchi, 
who is my colleague and the author of SL- 
LIB/SFITSICFl. for rewarding discussions. 



"SFITSIO is a light weight FITS library for C/C++, pro- 
viding modern APIs. 
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A. Tmpfs and Ramdisk 

Both tmpfs and ramdisk are a data space allocated on memory. One has to specify the size in advance for 
ramdisk, while one does not set the size for tmpfs in advance necessarily since it is under control of virtual 
memory manager and shares swap space. 

When an application requests the operating system for memory blocks and when there does not remain 
sufficient physical memory space, the memory manager firstly swap out the files on tmpfs. Tmpfs is ideal 
space to put temporal files which one requires very fast access to. 



B. Another Implementation of Byte Shuffle Algorithm 

One can also implements byte shuffle algorithm as follows: 

uint64_t byte_shuffle2(uiiit64_t a) 
{ 

unsigned char *p = (unsigned char *)&a; 

uint64_t b; 

unsigned char *q = (unsigned char *)&b; 



q[0] 


= 


p[r] 


q[l] 


= 


p[6] 


q[2] 


= 


p[5] 


q[3] 


= 


p[4] 


q[4] 


= 


p[3] 


q[5] 


= 


p[2] 


q[6] 


= 


p[l] 


q[7] 


= 


p[0] 


return 


b; 



} 

The number of assignments of the codes {— 8) is less than that shown in the main part of this paper (= 12), 
and one would expect further performance improvement. 

I disassembled both two codes compiled with the -02 option, and obtained foUowings: 

0000000000000000 <byte_sliuffle>: 






49 


89 


fa 




3 


49 


89 


f8 




6 


89 


fe 






8 


48 


89 


f9 




b 


89 


fa 






d 


48 


89 


f8 




10 


40 


88 


7c 


24 ff 


15 


48 


cl 


e8 


20 


19 


49 


cl 


ea 


38 


Id 


49 


cl 


e8 


30 


21 


66 


cl 


ee 


08 


25 


48 


cl 


e9 


28 


29 


cl 


ea 


10 




2c 


cl 


ef 


18 




2f 


44 


88 


54 


24 f8 



mov 


y.rdi,%rlO 


mov 


y.rdi,%r8 


mov 


7,edi,7.esi 


mov 


y.rdi,y.rcx 


mov 


y.edi //oedx 


mov 


Zrdi , Zrax 


mov 


y.dil,-Oxl(y.rsp) 


shr 


$0x20,y.rax 


shr 


$0x38,y.rl0 


shr 


$0x30, y.r8 


shr 


$0x8, Zsi 


shr 


$0x28,y.rcx 


shr 


$OxlO,y.edx 


shr 


$0x18, Zedi 



mov y.rl0b,-0x8(y.rsp) 

20 



34 


40 


88 


74 


24 


fe 


mov 


39 


44 


88 


44 


24 


f9 


mov 


3e 


88 


54 


24 


fd 




mov 


42 


88 


4c 


24 


fa 




mov 


46 


40 


88 


7c 


24 


fc 


mov 


4b 


88 


44 


24 


fb 




mov 


4f 


48 


8b 


44 


24 


fS 


mov 


54 


c3 










retq 



y.sil,-0x2(y.rsp) 

y.r8b,-0x7(%rsp) 

y.dl,-0x3(y.rsp) 

y.cl , -0x6 (Zrsp) 

y.dil,-0x4(y.rsp) 

Zal , -0x5 (Zrsp) 

-0x8(y.rsp),y.rax 



, and 

0000000000000000 <byte_shuf f le2> : 






49 


89 


fa 






mov 


y.rdi,y.rlO 


3 


49 


89 


f9 






mov 


y.rdi , yr9 


6 


49 


89 


fS 






mov 


yrdi , yr8 


9 


48 


89 


fe 






mov 


y.rdi, yrsi 


c 


89 


f9 








mov 


yedi/Zecx 


e 


89 


fa 








mov 


Zedi , y.edx 


10 


89 


f8 








mov 


yedi,y.eax 


12 


49 


cl 


ea 


38 




shr 


$0x38,y.rl0 


16 


49 


cl 


e9 


30 




shr 


$0x30, y.r9 


la 


66 


cl 


eS 


08 




shr 


$0x8,y.ax 


le 


49 


cl 


eS 


28 




shr 


$0x28, y.r8 


22 


48 


cl 


ee 


20 




shr 


$0x20,yrsi 


26 


cl 


e9 


18 






shr 


$0xl8,y.ecx 


29 


cl 


ea 


10 






shr 


$0x10, %edx 


2c 


44 


88 


54 


24 


f8 


mov 


y.rl0b,-0x8(yrsp) 


31 


44 


88 


4c 


24 


f9 


mov 


y.r9b,-0x7(y.rsp) 


36 


44 


88 


44 


24 


fa 


mov 


y.r8b,-0x6(y.rsp) 


3b 


40 


88 


74 


24 


fb 


mov 


ysil,-0x5(y.rsp) 


40 


88 


4c 


24 


fc 




mov 


ycl,-0x4(y.rsp) 


44 


88 


54 


24 


fd 




mov 


y.dl,-0x3(y.rsp) 


48 


88 


44 


24 


fe 




mov 


yal , -0x2 (y.rsp) 


4c 


40 


88 


7c 


24 


ff 


mov 


y.dil,-Oxl(y.rsp) 


51 


48 


8b 


44 


24 


f8 


mov 


-0x8 (y.rsp) ,y.rax 


56 


c3 










retq 





, that is, there are less assignments in byte_shuf f le2() (= 8) than byte_shuf f le() {— 12), however, the 
former binary codes are longer than the latter ones. Hence one cannot expect more performance gain with 
the codes. 



C. Bit Shift Algorithm and BSWAP Instruction 

The bit shift endian conversion codes is actually identical to BSWAP instruction when compiled with the 
optimization option of -02. The disassembled codes obtained with objdump -d command are below: 

0000000000000000 <bit_shift>: 

mov y.rdi , Zrax 

bswap y.rax 
retq 
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0: 


48 89 f8 


3: 


48 Of c8 


6: 


c3 



D. Endian Conversion Algorithms and Memory Alignment 

It is ensured that the leading memory address (alignment) of an array is always in multiplies of 16 (16-byte 
alignment) in amd64 architecture. However, if one would like to read a file in multi-thread, he/she has to 
make the copy of the file image on memory. In such case, the alignment is not always 16-byte. Thus I 
performed the benchmark described in i )3.4l (single thread case) but made alignment of the arraya random 
number. 

For the benchmark, I modified _mm_load_sil28() and _mm_store_sil28() in the SSE2 codes into 
_mm_loadu_sil28() and _mm_storeu_sil28() , respectively, to make the codes operable. The results are 
summarized in Table [T^ 

The trend found in i )3.4.1l is roughly true in this case though the all algorithms are slightly slower (within 
a few %) than 16-byte alignment case. Hence one does not have to get nervous about memory alignment. 



E. The Patches for CFITSIO 
E.l. fitsio2.1i 

*** fitsio2.h.org 2013-03-08 14:19:49.560538980 +0900 

— fitsio2.h 2013-01-15 14:43:10.000000000 +0900 

*** 96,102 **** 

#elif def iiied(__ia64__) I I def ined(__x86_64__) 

/* Intel Itanium 64-bit PC, or AMD opteron 64-bit PC */ 
! #define BYTESWAPPED TRUE 
#define LONGSIZE 64 

#elif defined(_SX) /* Nee SuperUx */ 

— 96,103 

#elif def ined(__ia64__) I I def ined(__x86_64__) 

/* Intel itanium 64-bit PC, or AMD opteron 64-bit PC */ 
! /* #define BYTESWAPPED TRUE */ 
! #define BYTESWAPPED FALSE 

#define LONGSIZE 64 

#elif defined(_SX) /* Nee SuperUx +/ 

*** 169,175 **** 

Table 12 
The endian conversion time in case that memory alignment is random 



Machine 


Bit Shift (msec) 


BSWAP (msec) 


SSE2 (msec) 


SSSE3 (msec) 


Byte Shuffle (msec) 


Machine A 
Machine B 


413 ±2 
624 ± 7 


416 ±3 
641 ±9 


416 ±5 
602 ±9 


377 ±2 
590 ±3 


3220 ± 10 
8320 ± 90 



Note. — The endian conversion time of 423,414,686 (—29,566x14,321) double-type elements with various algorithms in case that 
memory alignment is random. 
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/* generic 32-bit IBM PC */ 
#define MACHINE IBMPC 
! #define BYTESWAPPED TRUE 

#elif defined( arm ) 

— 170,178 

/* generic 32-bit IBM PC */ 
#define MACHINE IBMPC 
/* #define BYTESWAPPED TRUE */ 
#define BYTESWAPPED FALSE 

#elif defined( arm ) 

E.2. cfileio.c 

*** cfileio.c. org 2013-03-08 14:20:09.052539296 +0900 
cfileio.c 2013-01-16 19:57:46.000000000 +0900 

*** 3763,3769 **** 
} 

/* test for correct byteswapping. */ 
! 

u.ival = 1; 

if ((BYTESWAPPED && u.cval[0] !=1) II 

(BYTESWAPPED == FALSE && u.cval[l] != 1) ) 

— 3763,3769 

} 

/* test for correct byteswapping. */ 
! /* 

u.ival = 1; 

if ((BYTESWAPPED && u.cval[0] !=1) II 

(BYTESWAPPED == FALSE && u.cval[l] != 1) ) 

*** 3776,3782 **** 

FFUNLDCK ; 

return(l) ; 

} 
! 

/* test that LONGLONG is an 8 byte integer */ 

— 3776,3782 

FFUNLDCK ; 
return (1) ; 
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} 

! */ 



/* test that LOMGLONG is an 8 byte integer */ 
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